Runtime home mapping for effective memory resource usage
نویسندگان
چکیده
In tiled Chip Multiprocessors (CMPs) last-level cache (LLC) banks are usually shared but distributed among the tiles. A static mapping of cache blocks to the LLC banks leads to poor efficiency since a block may be mapped away from the tiles actually accessing it. Dynamic policies either rely on the static mapping of blocks to a set of banks (D-NUCA) or rely on the OS to dynamically load pages to statically mapped addresses (first-touch). In this paper, we propose Runtime Home Mapping (RHM), a new dynamic approach where the LLC home bank is determined at runtime by the memory controller when the block is fetched from main memory, trying to map each block as close as possible to the requestor thus speeding up execution time and lowering message latencies. Block migration and replication provide further improvements to basic RHM. Also, in a further optimization we eliminate the directory structure. All these optimizations involve specific NoC optimizations and co-designs. Results with PARSEC and SPLASH2 applications show, when compared with alternative solutions, that RHM achieves a 41% and 35% average reduction in load and store latencies respectively compared to static mapping. This leads to an average reduction of 28% in applications execution.
منابع مشابه
Garbage Collector Memory Accounting in Language-Based Systems
Language run-time systems are often called upon to safely execute mutually distrustful tasks within the same runtime, protecting them from other tasks’ bugs or otherwise hostile behavior. Well-studied access controls exist in systems such as Java to prevent unauthorized reading or writing of data, but techniques to measure and control resource usage are less prevalent. In particular, most langu...
متن کاملLow Latency and Memory Efficient Viterbi Decoder Using Modified State-Mapping Method
In this paper, a new implementation of the Viterbi decoder is proposed. The Modified State-Mapping VD algorithm combines the TB algorithm with the RE algorithm. By updating the starting point of the state for each memory bank, and by using Trace Back and Trace Forward information, LIFO (Last Input First Output) operation can be eliminated, which reduces the latency of the TB algorithm and decre...
متن کاملCutless FPGA Mapping
The paper presents a new algorithm for FPGA technology mapping into K-input LUTs. The algorithm avoids cut enumeration by incrementally computing and updating one good K-feasible cut at each node of the subject graph. The main advantage of the algorithm is that it works for very large LUT size while offering dramatic improvements in memory and runtime. For 10-input LUTs, the memory is reduced 2...
متن کاملWorkload Characteristics of a Multi-cluster Supercomputer
This paper presents a comprehensive characterization of a multi-cluster supercomputer workload using twelve-month scientific research traces. Metrics that we characterize include system utilization, job arrival rate and interarrival time, job cancellation rate, job size (degree of parallelism), job run time, memory usage, and user/group behavior. Correlations between metrics (job runtime and me...
متن کاملSPMPool: Runtime SPM Management for Embedded Many-Cores
Distributed scratchpad memories (SPM) in embedded many-core systems require careful selection of data placement such that good performance can be achieved. In this paper, we propose SPMPool to share the available on-chip scratchpads on many-cores among executing applications in order to reduce the overall memory access latency. By pooling SPM resources, we can assign underutilized memory resour...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Microprocessors and Microsystems - Embedded Hardware Design
دوره 38 شماره
صفحات -
تاریخ انتشار 2014